Performance of Single-processor Blas on Ibm P690 Performance of Single-processor Blas on Ibm P690

نویسنده

  • Inge Gutheil
چکیده

The Basic Linear Algebra Subprograms, BLAS, are the basic computational kernels in most applications. BLAS 1 and BLAS 2, the vector-vector and matrix-vector routines, require memory accesses in the same order as computations and thus cannot achieve performance close to peak performance on modern computer architectures. BLAS 3 matrix-matrix operations on n× n-matrices on the other side can do order n operations with only order n memory accesses. This much better ratio of computation to memory access allows for much higher performance. To show which performance can be expected using the BLAS routines from IBM’s ESSL on an IBM p690 we investigated the performance of one routine of each BLAS level and compared it to that of the corresponding routines on a CRAY T3E.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology

The POWER4-based p690 systems offer the highest performance of the IBM eServer pSeries line of computers. Within the general-purpose UNIX server market, they also offer the highest levels of concurrent error detection, fault isolation, recovery, and availability. High availability is achieved by minimizing component failure rates through improvements in the base technology, and through design t...

متن کامل

L2-Cache Miss Profiling on the p690 for a Large-scale Database Application

This paper profiles L2-cache data-load misses generated by the TPC-C benchmark executed on 8and 32-way configurations of the IBM eserver pSeries 690 (p690). Using sampled performance monitor event traces, the resolution sites of L2-cache data-load misses are identified. To determine ways to enhance performance, the heavily hit resolution sites, L3 caches and main memory, are studied with respec...

متن کامل

Memory Performance Profiling via Sampled Performance Monitor Event Traces

Memory performance can be studied, process behavior can be characterized, and application performance can be improved through the use of sampled performance monitor event traces. As an example, this paper demonstrates how sampled traces of the TPC-C benchmark executed on eightand 32-processor configurations of the IBM eServer pSeries 690 (p690) are analyzed to identify the resolution sites of l...

متن کامل

Organization and implementation of the register-renaming mapper for out-of-order IBM POWER4 processors

We present a new nonconventional approach for designing and organizing register rename mappers that can be applied in modern out-of-order processor chips. A content-addressable memory (CAM) configuration optimal for such a register mapper application was developed. The structure of the CAM and search engine, described in this paper, facilitates the implementation of the register mapper as a gro...

متن کامل

Comparing Linux Clusters for the Community Climate System Model

In this paper, we examine the performance of two components of the NCAR Community Climate System Model (CCSM) executing on clusters with a variety of microprocessor architectures and interconnects. Specifically, we examine the execution time and scalability of the Community Atmospheric Model (CAM) and the Parallel Ocean Program (POP) on Linux clusters with Intel Xeon and AMD Opteron processors,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004